CatRelate: A New Hierarchical Document Category Integration Algorithm by Learning Category Relationships
نویسندگان
چکیده
We address the problem of integrating documents from a source catalog into a master catalog. Current technologies for solving the problem deem it as a flat category integration problem without considering the useful hierarchy information in the catalog, or deal with it hierarchically but without a rigorous model. In contrast, our method is based on correctly identifying relationships among categories, such as Match, Disjoint, SubConcept, SuperConcept, and Overlap, which come from the relations of sets in Set theory. Compared with traditional Match/NotMatch relationship in literature, our approach is more expressive in defining the relationship. The relationships among categories are first learned in a probabilistic way, and then refined by considering the hierarchy context. Our preliminary experiments show that it can help to correctly identify category relationships, and thus increase the accuracy of document integration.
منابع مشابه
Hierarchical text categorization using fuzzy relational thesaurus
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category. The goal of our approach is twofold; to develop a reliable t...
متن کاملOptimal sequencing during category learning: Testing a dual-learning systems perspective.
Recent studies demonstrate that interleaving the exemplars of different categories, rather than blocking exemplars by category, can enhance inductive learning-the ability to categorize new exemplars-presumably because interleaving affords discriminative contrasts between exemplars from different categories. Consistent with this view, other studies have demonstrated that decreasing between-categ...
متن کاملA hierarchical K-NN classifier for textual data
This paper presents a classifier that is based on a modified version of the well known K-Nearest Neighbors classifier (K-NN). The original K-NN classifier was adjusted to work with category representatives rather than training documents. Each category was represented by one document that was constructed by consulting all of its training documents and then applying feature selection so that only...
متن کاملA Theoretical Framework for Web Categorization in Hierarchical Directories using Bayesian Networks
In this paper, we shall present a theoretical framework for classifying web pages in a hierarchical directory using the Bayesian Network formalism. In particular, we shall focus on the problem of multi-label text categorization, where a given document can be assigned to any number of categories in the hierarchy. The idea is to explicitly represent the dependence relationships between the differ...
متن کاملExperiments with multi-label text classifier on the Reuters collection
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow...
متن کامل